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Abstract 

Distance based algorithms are a common technique in the construction of phylogenetic trees from 
taxonomic sequence data. The first step in the implementation of these algorithms is the calcu- 
lation of a pairwise distance matrix to give a measure of the evolutionary change between any 
pair of the extant taxa. A standard technique is to use the log det formula to construct pairwise 
distances from aligned sequence data. We review a distance measure valid for the most general 
models, and show how the log det formula can be used as an estimator thereof. We then show that 
the foundation upon which the log det formula is constructed can be generalized to produce a pre- 
viously unknown estimator which improves the consistency of the distance matrices constructed 
from the log det formula. This distance estimator provides a consistent technique for constructing 
quartets from phylogenetic sequence data under the assumption of the most general Markov model 
of sequence evolution. 
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1 Introduction 



The distance based approach to phylogenetic reconstruction using the neighbor joining algorithm 
is a commonly used technique II HI llfil I18| . Under the assumptions of a Markov model of se- 
quence evolution, the phylogenetic relationship is uniquely reconstructible from (suitably defined) 
pairwise distances |2H| . The approach relies crucially upon the calculation of distance matrices 
from aligned sequence data which give a measure of the pairwise evolutionary distance between 
the extant taxa under consideration. As far as tree building algorithms are concerned it is re- 
quired that the distances are strictly linearly related to the sum of the (theoretical) edge lengths 
of the phylogenetic tree, and that the parameters of the linear relation do not vary across the 
tree. It is essential to the analysis that the measure of distance chosen has both biological and 
statistical as well as mathematical significance. If one assumes the standard Markov model, the 
edge lengths of a phylogenetic tree can be taken mathematically to be a quantity which we refer 
to as the stochastic distance. (For mathematical discussion of this quantity see Goodman |5j who 
refers to the stochastic distance as intrinsic time, and see also Barry and Hartigan £Q who gave a 
biological interpretation.) Under the assumptions of a general Markov model the logdet formula 
is commonly used to obtain pairwise distances. Further, if one may assume a stationary process 
then the logdet formula can be modified to give an estimate of the actual stochastic distance |12) . 
(That is, the constants of the linear relation are set by the stationarity assumption.) 

Distance based methods and, consequently, the logdet formula are often used in favour 
of other methods (such as maximum likelihood) in cases where there has been significant com- 
positional heterogeneity during the evolutionary history. The theoretical basis which motivates 
this usage was presented by Steel [22 an d is discussed in Lockhart, Steel, Hendy and Penny J2| 
and Gu and Li More recently, Jermiin, Ho, Ababneh, Robinson and Larkum published a 
simulation study which confirms that the logdet outperforms other techniques in this case 
Lockhart et al. showed that by using the assumption that the base composition remains close to 
constant, the logdet formula can be modified to give an estimate of the actual stochastic distance. 
However, as will be shown, in both its original and modified form the logdet formula includes 
an approximation which is crucially dependent upon the compositional heterogeneity remaining 
minimal. The effectiveness of the log det formula to correctly reconstruct the phylogenetic history 
when there is been significant compositional heterogeneity is thus brought into question. Hence 
there is a contradictory state of affairs between the theoretical basis of the logdet and the cir- 
cumstances under which it is implemented. In this paper we will generalize the logdet formula in 
such a way that this dependence upon base composition is truly absent. 

A disadvantage of the logdet formula is that it uses only pairwise sequence data and is 
blind to the fact that extra information regarding pairwise distances can be obtained from the 
sequence data of additional taxa. Felsenstein 6 mentions that it is surprising that distance tech- 
niques work at all given that they ignore the extra information in higher order alignments. This 
paper details exactly how the log det formula can be improved upon by taking functions of aligned 
sequence data for three taxa at a time. It may seem counter-intuitive that consideration of a 
third taxon can impart information regarding the evolutionary distance between two taxa, but 
it is the case that by considering a third taxon the logdet formula can be refined. This result 
depends crucially upon the fact that, as is somewhat trivially the case for two taxa, there is only 
one possible (unrooted) tree topology relating three taxa. (For discussion of what a tree topology 
is see Chapter 5.) It is possible to refine the logdet formula by considering the respective 
distance to an arbitrary third taxon. (The reader should note that the use of triplet sequence data 
to the problem of reconstruction of the Markov model also was also considered in jl] and ^H] ■ The 
approach discussed in the present work is original in the sense that triplets of the aligned sequences 
are being used explicitly in a distance method, and follows on from the theoretical discussions of 

Ego 

A complication arises regarding the total stochastic distance between leaves and the place- 
ment of the root of a phylogenetic tree. It turns out that if we define phylogenetic trees of identical 
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topology to be equivalent if they give the identical probability distributions then we find that the 
total stochastic distance between leaves is not, in general, left unchanged as we move the root of 
the tree. The so defined equivalence class provides a generalization of Felsenstein's pulley principle 
5: and was first presented in Steel, Szekely and Hendy [22]- The fact that the stochastic distance 
is not left unchanged is a surprising result and has important implications regarding the interpre- 
tation of the edge lengths of phylogenetic trees defined under the Markov model. In particular 
this result implies that the log det technique is an inconsistent estimator of pairwise distances on 
phylogenetic trees. It is the purpose of this work to present a new estimator which is consistent 
in the case of phylogenetic quartets. We are motivated to present this construction of quartet 
distance matrices by the interest in phylogenetic reconstruction of large trees from the correct 
determination of the set of (?) quartets E] • 



The general Markov model on phylogenetic trees and 
stochastic distance 



It is standard to model sequence evolution as a stochastic process. The discrete space /C is 
associated with molecular units which we refer to as bases and we define n := |/C|. For example, in 
the case of DNA sequences we have K. = {A, G, C, T} and n = 4. We then consider each instance 
of a base to be a random variable X e JC and the stochastic time evolution of sequence data is 
modelled as a continuous time Markov chain (CTMC) such that 

jF(X(t) = i) = Y / F(X(t)=j)q ji (t), 1,3 EJC. (1) 

3 

The Qij(t) are called rate parameters and satisfy the relations 



Qij (t) > 0, Vz ^ j: q H (t) = - ^ (t) 



(2) 

We define Q(t) = [qij(t)],- ^ &K s as the rate matrix associated with the Markov chain. The Markov 
chain is called homogeneous if the rate matrix is time independent. The results presented in this 
paper are equally valid for inhomogcncous models where the rate matrix is time dependent and 
so we allow for this generality throughout. It is also common to impose further symmetries upon 
the rate matrix such as the Jukes Cantor and Kimura 3ST models |14| . However, the results 
presented here are again valid for any rate matrix satisfying @, and hence no restriction upon 
the rate parameters is made. 

For notational simplicity we will write 7Tj(£) := ¥(X(t) — i) and, given an initial distribu- 
tion 7Tj(0), write solutions of as 

7Tj(t) = y]<n-j(s)m.ji(s,t), < s < t; 

where m,j(s, t) :— ¥(X(t) = j\X(s) = i) are the transition probabilities of the chain. We define the 
matrix M(s,t) = [m,j(s, t)], t such that in the homogeneous case the transition probabilities 
only depend on the difference (t — s) and can be represented in terms of the rate matrix as 

M(s, t) = M(0, t-s) = e^*"*)] := V Q " [( *7' s)] " . 



In the inhomogeneous case there are several representations available for the matrix of transition 
probabilities (for details see ^3E|)- The representation that is of most use to us here is the time 



3 



ordered product, which can be written for sufficiently small St in the approximate form 
M(s, t) ~ M(s, s + 5t)M(s + St, s + 2St)..M(t - 2St, t - St)M(t - St, t) 

_ e Q(s)St e Q(s+St)8t Q(t-2St)St Q(t-8t)St ^ 

From this solution it is easy to show that the backward and forward Kolmogorov equations: 

d -^ = -Q(s)M(s,t), 

dM(s,t) N v ' 

g t ' =M{s,t)Q{t), 

are satisfied as required of any CTMC 



2.1 Stochastic distance 



In this work we will be interested in the assignment of edge lengths to phylogenetic trees. To this 
end we consider the rate of change of base changes at time s: 1 

. = V dF{X{t)=i\X{s)±i) _ 

u • 2^ gt its- 

By considering J2J and Q this quantity can be explicitly expressed using the rate parameters: 

A(s) = - ^2<lii(s), 

i 

= -trQ(s). 

From these considerations we define the stochastic distance to be given by the expression 

uj(s,t) :— J X(u)du. 

By considering the time ordered product representation |JS} and the Jacobi identity det e x — e trX , 
we find that the stochastic distance can be directly related to the transition probabilities of the 
Markov chain: 

co(s,t) = -In det M{s,i). (5) 

Our assignment of edge lengths will take the Markov matrix associated with each edge and set 
the edge length equal to the stochastic distance. 

The relation J^J is known in various guises in both the mathematical and phylogenetic 
literature |H] and, as will be confirmed in the next section, is the basis of the log det formula. 
It should also be noted that © will remain positive and finite because oj(s, s) — 0, A(s) > and 
the integral J Q T X(t)dt is not expected to diverge. 2 

1 It is standard to include a factor of n~ 1 in this definition. However, this factor clutters the consequent formulae 
and here we do not include it as it has no consequence to the forgoing discussion and can always be incorporated 
into the analysis later. 

2 There are two cases where the integral may diverge, but we can safely exclude these possibilities as follows, i. 
\(t) may be a badly behaved function. We can reject this possibility outright in phylogenetics as there is every 
reason to expect the rate parameters to change smoothly with time. ii. T — > oo. We can safely ignore this 
possibility as we will be assuming that the divergence times of the Markov chain are sufficiently small such that 
the phylogenetic historical signal is still obtainable. 
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Figure 1: Phylogenetic tree of four taxa 
2.2 Phylogenetic trees 

The remaining task is to model the case of multiple taxa evolving under a stochastic process. Effec- 
tively the model consists of multiple copies of the random variable X(t) taken as a generalization 
(via a tree structure) of a cartesian product and then modelled collectively as a CTMC. The reader 
is referred to ^H] for a more extended discussion of the model. Here we keep the presentation to 
a minimum while allowing for the introducing of some essential notation and concepts. 

A tree is a connected graph without cycles and consists of a set of vertices and edges 
T = (V,E). Vertices of degree one are called leaves and we partition the set of vertices as 
V = L U N where L is the set of leaves and N is the set of internal vertices. We direct each 
edge of T away from a distinguished vertex, w, known as the root of the tree. Consequently, 
a given edge lying between vertices u and v is specified as an ordered pair e = (u,v), where u 
lies on the (unique) path between v and it. The stochastic phylogenetic model is then made by 
assigning a set of random variables {X e , e E E} to each edge of the tree; these random variables 
are assumed to be conditionally independent and individually satisfy the properties of a CTMC. 
Taking a distribution at the root of the tree, {F^A^ = i) := 7Tj, i E K,}, completes the phylogenetic 
tree. The interpretation of a phylogenetic tree is that the probability distribution at each leaf 
is associated with the observed sequence of a single taxon and the joint probability distribution 
across a number of leaves is associated with the aligned sequences of the same number of taxa. 

For example in Figure ^ we present the tree consisting of 4 taxa which has probability 
distribution 

E(5) (1) (2) (3) (4) 

j,k 

where 

Phi^U : = ^{Xx = h,X 2 = i%,X 3 = z 3 , X 4 = £ 4 ) 
and we refer to these quantities as pattern probabilities. 



3 Pairwise distance measures 



In this section we will derive and discuss a standard approach to the construction of distance 
matrices. (For an excellent perspective of the various measures of phylogenetic pairwise distance 
see 0.) A distance matrix, <p — [4>ab\( a .b)eLi l& constructed from the aligned sequence data of 
multiple extant taxa such that each entry gives a suitable estimate of the distance between a given 
pair of taxa. The mathematical conditions on the 4>ab are the standard conditions of a distance 
function as well as the four point condition |23j (which is required for the distance measure to be 
consistent with the tree structure): 

4>ab > 0, 

(j) ab = iff a = b, 

A. M ( 6 ) 

<Pab — <Pba, 

4>ab + 4>cd < max{4> ac + 4> bd , <j) ad + 4> bc }; V a, b, c, d E L. 
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There are no further conditions required upon <p for it to give a unique tree reconstruction |23| . 
However it is of course desirable for the distance measure to have a well defined biological inter- 
pretation. To this end, for a given edge e, we define the edge length, u) e , which we set to be the 
stochastic distance taken from the Markov model: 

uj e = — In det M e . 

It is then apparent that any significant estimate of pairwise distance must statistically be expected 
to converge to a value which is linearly related to the sum of the stochastic distances lying on the 
(unique) path between the two taxa under consideration. It should be clear that such a measure 
will satisfy the relations (JSJ . It is crucial to the performance of the distance measure under a tree 
building algorithm that the parameters of the linear relation are expected to be constant for all 
pairs of taxa. That is, given the unique path between leaf a and b, P(T; a, b), we are demanding 
that statistically we have the following convergence: 

4> ab — > au}(a, b) + (3, 

where 

w(a,b) ;= ^2 

eeP(T-a.b) 

and a and (3 are expected to be independent of a and b. As we will see, the log det formula does 
not satisfy this property for the most general models. 



3.1 The log det formula 



In Figure |21 we consider the two taxa phylogcnctic tree, with pattern probabilities given by 

E(l) (2) 



3*2 ' 
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By considering the matrices defined as 



(<) 



Ar : = [diag{ni)] teK ; 
it is easy to show that is equivalent to 

P (1 < 2) = MiArMj. 

Taking the determinant of this expression and considering (|SJ) yields 

det P (1 ' 2) = det Mi det M 2 det D n 

= e -(o )1+W2 ) jj^ (8) 




1 2 3 

Figure 3: Phylogenetic tree of three taxa 

This expression can be generalized to the case of any two taxa from a given phylogenetic tree: 

detP^ = e — JJ^°' 6) , (9) 

i 

where 7r| a ' 6 ' 1 is the distribution at the most recent ancestral vertex between taxa a and b determined 
by the meeting point of the two paths traced backwards along the phylogenetic tree from leaf a 
and b. 

Now uj(a, b) is theoretically equal to the total stochastic distance between each of a and 
b and their most recent ancestral vertex and hence it is clear that — logdet p( a - fc ) will be linearly 
related to this quantity. In the original formulation of the logdet, a distance measure between two 
taxa was defined as 

d ab : = - logdet 

= .(M)-£lnk ( n ( 10 ) 

i 

and shown to satisfy the conditions @ |3T]. From this relation it seems that one can take a = 1 
and = — Yli M 71 "! ] an( l evaluate (|10() on the observed pattern frequencies for each pair of taxa 
to calculate a well defined distance matrix from a set of aligned sequence data (as was presented in 
|12|). This procedure depends crucially upon the shifting term [3 = hi[7r- ] being independent 
of a and b. However, this is only true in special circumstances such as star phylogeny or if the 
base composition is constant (the stationary model). In the general case, one is led to a different 
shifting term depending on the topology of the tree (this was noted in Sumner and Jarvis (2D and 
we reproduce the result here). Consider the phylogenetic tree of three taxa given in Figure [3] with 
pattern probabilities given by 

E(l) (4) (2) (3) 
it 4m - to ., to, - to, - . 

By calculating (|10|l for the three possible pairs of taxa we find that 

di2 = (uji + uj 4 + w 2 ) - £ \a TTi, 

i 

di3 = {u)\ + CJ 4 + Us) - ^2 ln7r «> 

i 

d 23 = (uji + lo 3 ) - y^lnpt, 

i 

from which it is explicitly clear that the shifting term is not constant across this phylogenetic tree. 
The shifting term is dependent on the base composition at the most recent ancestral node of the 
two taxa and from the above example it is clear that this depends on the topology of the tree and 
is not always simply the root of the tree. This means that (|10l) does not produce distance matrices 
whose entries are linearly related to the edge length of the tree because the entries of the matrix 
will depend essentially upon the topology of the tree. 

It is, however, possible to obtain an estimate of the total stochastic distance between any 
two taxa by modifying the logdet formula. The ancestral base composition is approximated by 
using the harmonic mean 



H «2 



(11) 
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where Tri"' is the closest common ancestral base composition between taxa a and b and tt\ := 
P(X a (r a ) = i) (and similarly for b). One is then led to the formula 

d; b :=-lndetP(^) + iE« 1) i 2 an4° ) + ln 4 b) )' Va,6eL. (12) 

where d' ab is then an estimator of the total stochastic distance between taxa a and b. (This form 
of the logdet formula was presented in j^] and |2*3)1. 

In the case of a stationary base composition model the additional assumption is made 

that 

^jTrijf — TTi, Vee£. 

3 

In this case we have 

(a,fe) (a) (6) w , T 

and it is clear that the the harmonic mean approximation becomes an exact relation and the 
logdet formula is expected to converge exactly to the total stochastic distance between the two 
taxa. 



3.2 The tangle 



In this section we will show how the logdet formula can be generalized to obtain, for the most 
general Markov models, an unbiased estimate of the distance matrix. The basis of the technique 
is the existence a measure analogous to (JSj which is valid for triplets. 

Sumner and Jarvis |20| presented a polynomial function T which is known in quantum 
physics as the tangle and can be evaluated on phylogenetic data sets of three aligned sequences in 
the case of n = 2. Evaluated on the pattern probabilities of any phylogenetic tree of three taxa, 
{a, b, c}, the tangle takes on the theoretical value 

T(a,b,c) = e-^ a > b ^ , (13) 

XieK. J 

where 

uj(a, 6, c) := 2J u e 

eET 

ir is the common ancestral root of the three taxa and this relation holds independently of the 
particular tree topology which relates {a, 6, c}. This independence upon the topology is a very 
nice property and is crucial to the practical use of the tangle as a distance measure. The similarity 
between Ijl3|) and JHJ should be noted. 

In this work we report generalized tangles, which are polynomials which satisfy ljl3|l for 
the cases of n — 3,4 in addition to the n = 2 case which was presented in [3D]. It is possible 
to infer the existence of the tangles and derive their polynomial form from group theoretical 
considerations. Here we give forms using the the completely antisymmetric (Levi-Civita) tensor, 
e, which has components ei^...^ and satisfies ei2... n = !• For the cases of n = 2,3,4 the tangles 
are given by 3 

3 This expression for T2 corrects for the erroneous expression presented in I2UI . 
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/y- 1 

-'4 — 4T /^ilPiijik 1 Pi 2 j 2 k 2 Pi3j3k3Pi i j4,k4,Pi 5 j 5 k 5 Pi e j e k e Pi 7 j 7 k 7 Pi 8 j 8 k 8 

' e ili 2 i3ii e ihi6hi3 € 3l353i3& e 32i6i3i7 e k 1 k5k 2 k fi £k3k 7 k i k s , j 

respectively, (where the summation is over every index). The expression l|13[) can be proved by 
studying the group theoretical properties of the tangle (see [201 ) an d by explicitly expanding the 
above forms. For the tangle on two characters we find 

T-2 = -P122P2II + 2pi2lPl22P21lP212 ~ P\1\P\\1 + 2pil 2 Pl22P21lP221 + 2pn 2 Pl2lP212P221 ~ 
4pmPl22P212P221 -P112P221 ~ 4pn 2 pi2lP21lP222 + 2j5 m pi22P21lP222 
+ 2p m pi2lP212P222 + 2pmPii2P22lP222 ~ p\\\V\l<l- 

Substantial computer power is required to explicitly compute Tj, and 74. These polynomials have 
If 52 and 43f424 terms, respectively. The expansions has been achieved by the authors, who can 
be contacted to obtain a practical algorithm, or the polynomials themselves. 



3.3 Star topology 



Consider the phylogenetic tree relating three taxa with a star topology: 




with pattern probabilities given by the formula 

E(l) (2) (3) 
^ m K m ji2 m ji 3 - (14) 

3 

Here we will use the fact that the root of this tree is also the common ancestral root of any pair 
of the three taxa. (This is not the case in general if we allow for a general rooting of the tree 
and/or more than three taxa. The complications arising in these cases will be dealt with in the 
next section.) 

Considering the formulae (|13|l and (JSJ we are led to introduce the novel distance matrix, 
A, with the pairwise distance between {a, b} given by 

A^ } := -In T(a, 6, c) + In detP (a ' c) + In detP (& ' c) , a,b,ceL. (15) 
From © and ljT3)l it follows that 

such that our new formula will directly give the stochastic distance between the two taxa. There is 
no need to make the harmonic mean approximation and this distance measure is mathematically 
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and biologically meaningful. This is the main result of this paper: given a set of aligned sequence 
data, the tangle formula i|15|) can be used to compute the exact pairwise edge lengths for any 
triplet. As mentioned above, the explicit polynomial form of the tangle has been computed for 
the cases of two, three and four bases and it is our intent that i|15[l will provide a significant 
improvement over the logdet formula in the calculation of pairwise distance matrices for these 
cases. 



3.4 Summary 

Considering the stochastic distance to be the correct way to assign edge lengths to branches of a 
phylogenetic tree, we have reviewed three different ways of obtaining a distance measure between 
any two taxa a and b: 

1. d ab = -lndetPO' 6 ) 

2. d' ab = -IndetPW) + lEi^M? +ln7rg ) ) 

3. = - In T(a, b, c) + In det P^ + In dot P^ 

where one substitutes the observed pattern frequencies into these expressions. From the previous 
considerations we found that these three distance measures have the following properties: 

1. When d a b is evaluated on a set of observed pattern frequencies, this estimator satisfies the 
requirements of a distance function 10, but is inconsistent with the general Markov model 
as the estimate is not expected to converge to a value that is linearly related to ui(a, b). 

2. When d' ab is evaluated on a set of observed pattern frequencies, this estimator satisfies the 
requirements © and is expected to converge to a value that is linearly related to to (a, b) 
whenever the compositional heterogeneity is absent. In the heterogeneous case this quantity 
approximates u(a, b) by using Qlljl. 

3. When A^* is evaluated on a set of observed pattern frequencies, this estimator satisfies the 
requirements of JBJ) and is expected to converge exactly to w(a, b) in all cases. 

Thus we see that the tangle formula (|15|) should be a significant improvement as an empirical 
estimator of to(a, b) upon both forms of the logdet formula. However, the formula l|15fl depends 
on taking an arbitrary third taxon, c. The question remains as to what to do in the case of 
constructing pairwise distances for sets of greater that three taxa. The surprising answer to this 
question will be addressed in the next section where we will bring into question the uniqueness of 
the theoretical quantity u(a,b). The discussion has consequences for the interpretation of each of 
the estimators of pairwise distances that we have discussed. 



4 Generalized pulley principle 

In this section we generalize the Felsenstein's pulley principle |jy. In its original formulation the 
pulley principle describes the unrootedness of phylogenetic trees where the underlying Markov 
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model is assumed to be reversible and stationary. Here we show how the pulley principle may be 
generalized to remain valid under the most general Markov models. Our immediate motivation is 
to show that (|15f) remains a valid distance measure under the circumstance of a general phyloge- 
netic tree of multiple taxa. Unfortunately this generalization introduces surprising mathematical 
complications which have consequences not only for our formula l|15p. but also for the logdet tech- 
nique and any other estimate of the stochastic distance upon a phylogenetic tree. The discussion 
will lead to the consequence that, for a given tree topology, there are multiple - actually, infinitely 
many - phylogenetic trees with identical probability distributions. (These phylogenetic trees differ 
by arbitrary rerootings and consequential re-dircction of edges.) We will sec that the generalized 
pulley principle shows that as far as inference from the observed pattern frequencies is concerned, 
there is no theoretical justification behind specifying the root of a phylogenetic tree if the most 
general Markov model is allowed. Also, we will see that the theoretical value of the stochastic 
distance is not constant for arbitrary rerootings of a phylogenetic tree. Clearly, if the stochastic 
distance is not uniquely defined theoretically, then one must be careful in interpreting any formula 
which gives an estimate thereof from the observed data. 

Considering a phylogenetic tree as a directed graph shows that a rerooting involves redi- 
recting an edge (or part thereof). The property required is that the Markov chain on the involved 
edge is taken to progress as if time has been reversed, and we refer to the new chain as the 
time-reversed chain. This should be compared to the requirement of reversibility as defined in the 
mathematical literature, (for example see QHI)- I n the case of a stationary and reversible Markov 
chain the time-reversed chain (as we will define) is identical to the original chain. 

By way of example, we take the rooted tree of three taxa 1111) and redirect the relevant 
internal edge to give the following rerooting: 





(16) 



2 3 
rooted at n 



1 2 3 

rooted at p 



Our immediate task is to infer the existence of an appropriate time-reversed Markov chain, N, 
such that these two phylogenetic trees give identical probability distributions. If we equate the 
pattern probabilities of (|16f) and contract all edges except the one we are reversing, we are led to 
the simple algebraic solution 



Pi 



(17) 



(This solution was presented in |22|.) Presently we use this result to give an explicit form in the 
general case. 

Given a CTMC X(t) with transition probabilities 

my(M) ■=HX(t)=i\X(s)=j), 

we wish to find a second CTMC, Y(t), such that, given any T > 0, we have 

F(Y(t)=i)=n(T-t), V 0<t<T. 

That is, if the direction of time is reversed, the second CTMC Y(t) has identical distribution to 
X(t). The uniqueness of Y(t) is a technical matter which we do not consider, because in the 
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phylogenetic case there are extra restrictions which led to the unique solution l|17|) . 
Considering again the general case, we write 



F{Y(t)=j\Y(s)=i) := nij (s,t) 
and use l|17|l to infer the general solution 



mjj(T -t,T - s)-Kj(T - t) 

7Tj(T-s) 



nij(s,t) 



It is trivial to show that these transition probabilities satisfy the requirements of a CTMC: 

^2nij(s,t) = 1, Vj, 

3 

N(s,t)N(u,s) = N(u,t), 

where N(s,t) = [nij(s,t)]( i)jeK y 

Furthermore, by using we find that the rate parameters of the time-reversed chain can 
be expressed as 



dnij(s,t) 
~ - M 1 



fij(s) ■ = 

_ q jl (T - s)7T l (T - s) $ij<lik{T - s)ir k (T - s) 



nj(T-s) ^ Tfj(T-s) 

From which it follows that 

/«(«)> 0, Vi^j; fu(s) = -J2fv( s ) 

which confirms that the fij(s) are a valid set of rate parameters for a CTMC (as expected). It 
should be noted that even in the case where X(t) is a homogeneous chain it is certainly not the 
case in general that Y(t) is also homogeneous. Consider, however, the stationary and reversible 
case, with the respective conditions: 



qijiTj(0) = qjiTti(0). 
where the stationarity condition ensures that 

7T»(f) = 71,(0), Vt. 

In this circumstance it follows that 



f ij — Qij , 

such that Y(t) = X(t) and is hence also stationary and reversible. This was the basis of Felsen- 
stein's initial formulation of the pulley principle - if one considers only stationary and reversible 
Markov chains on a phylogenetic tree, any time-reversed chain is identical to the original Markov 
chain and hence a phylogenetic tree can be arbitrarily rerooted. We have given a continuous time 
generalization of Felsenstein's result which removes the stationary and reversible restriction. 

Equipped with the solution (|18|) it is possible to take any phylogenetic tree and find an 
alternative tree of identical topology, but rooted in a different place, such that the alternative 
tree generates an identical probability distribution to that of the original. This is the basis of our 
generalized pulley principle. 

The reader should note that we have proven that, under the assumptions of the most 
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general Markov model, it is not possible to determine the root of a phylogenetic tree by only 
considering the probability distribution it generates. Thus, any procedure which determines the 
root from the observed pattern frequencies must be justified by making additional assumptions 
about the underlying stochastic process. 

The curious aspect of the general pulley principle is that the stochastic distance is not 
conserved along the edge of the tree where the directedness was reversed. This is easy to show by 
considering the determinant of (|18|l 

det N(s, t) = det M(T -t,T-s)T[ - *j (18) 

J ".- L lTi(T - s) y > 

Thus the stochastic distance in the reversed time chain is equal to that of the original chain if and 
only if 



n 



m(T - 1) 

^i= h < 19) 



This property of CTMCs and their time-reversed counterparts was observed by Barry and Hartigan 
PP. It can be seen that in the stationary case will certainly be true. There are other cases 
where (|19|l may hold but there does not seem to any biologically sound way to interpret the required 
condition. In the proceeding discussion we will consider the consequences of the generalized pulley 
principle upon the interpretation of distance matrices. We see that for a given observed distribution 
we can use the generalized pulley principle to show that there are multiple edge length assignments 
using the stochastic distance which are consistent with the Markov model on a phylogenetic tree. 
These edge length assignments differ from one another as a consequence of Q18p. 



4.1 Interpretation 



For illustrative purposes we consider the consequence to the stochastic distance of the rerooting 
of a phylogenetic tree of two taxa. We consider the phylogenetic trees illustrated in Figure 0] and 
by using the generalized pulley principle define their respective transition matrices so that their 
probability distributions are identical: 



Pi X!"'"'"- 

j 

We find that in the first case that we have 

av(l,2) = -lndetMi — In det M - In det M 2 , 

and in the second case 

w p (l,2) = - lndetMi - In det N - In det M 2 . 

Now in general det M ^ det N and we see that the two possible pairwise distances are not expected 
to be equal. However, from an empirical perspective it is impossible to distinguish these two 
possible theoretical scenarios because the probability distributions are identical. Now because any 
estimator of the pairwise distance must be inferred from the observed distribution, we conclude 
that one must be careful to consider exactly what theoretical quantity one is obtaining an estimate 
of. For the case of the log det formula we find that quantity it is estimating depends essentially 
upon the base composition of the observed sequences as follows: 
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Figure 4: Using the generalized pulley principle. 



Consider the pairwise distance d' ab given by l|12|) , from the generalized pulley principle we 
see that this formula will give an estimate of the stochastic distance between a and b, where the 
common ancestral node is placed such that the quantity 

i 

2 

(20) 

is minimized. Thus the logdet method will be inconsistent in the sense that, if there has been 
compositional heterogeneity, the pairwise distance it produces will be an estimate for the edge 
length assignment where x(a, b) is minimized. This may have nothing to do with true placement 
of the common ancestral vertex and it may even be the case that %(a, b) has multiple minimum 
points. The situation amounts to the fact that, for a given phylogenetic tree, one is (potentially) 
using the logdet to estimate pairwise distances with a different edge length assignment for each 
and every pair of taxa. Clearly for the analysis of multiple taxa this could be become a significant 
problem and any alternative approach which removes this inconsistency would be beneficial to the 
analysis. 

We see that the consequences of the generalized pulley principle and l|18|) to the interpre- 
tation of the Markov model of phylogenetics are quite subtle. The generalized pulley principle is 
telling us that there is no direct way to distinguish the rootedness (and equivalently the direct- 
edness of internal edges) of phylogenetic trees. This is due to the fact that there are (infinitely) 
many phylogenetic trees of identical topology which generate identical probability distributions, 
differing only by the assignment of stochastic distance and the associated redirection of internal 
edges. 



X(a,b) \\ 



n 



5 The quartet case 



In this section we will show that in the case of a phylogenetic tree of four taxa, the tangle 
can be used to construct consistent quartet distance matrices. These distance matrices will be 
consistent in the sense that theoretically they are constructed from one topology with one edge 
length assignment. This should be opposed to the logdet formula which in the general case can 
be estimating a different edge length assignment for each and every pairwise distance. 

For analytic purposes we use the generalized pulley principle to root the four taxon tree 
in two ways, as illustrated in Figure [SJ The difference between the two cases is simply in the 
directedness of the internal edge and the generalized pulley principle allows us to calculate the 
required transition probabilities so that the two trees generate identical probability distributions. 
The pattern probabilities for the two cases are given by 

E(5) (1) (2) (3) (4) 
^J m 3 k m ji 1 m 3 i 2 m ki 3 m ki i 

j ' k (21) 

E(5 (1) (2) (3) (4) V > 

OjU /m) -TO, - TO - TO-- 
rj 3 k ki x ki 2 j i?, 3 n 
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where to ensure the equality of the two expressions we have 

(5) 



(5) _ 



Pj 

and p l = J2j Kjmf). 

From these expressions we wish to calculate the theoretical values of the formula H15fl for 
each possible group of three taxa. To obtain these values one simply chooses the form of the tree 
such that after the deletion of a fourth taxon one is left with a three taxon tree of star topology. By 
sequentially deleting one taxon at a time we are led to the four star topology subtrees illustrated 
in Figure (BJ and the corresponding pattern probabilities are given by the expressions 

(123) V-v (1) (2) (5) (3) 

(124) V"^ (!) ( 2 ) ( 5 ) ( 4 ) 
Pijk = l^^h m hi m h m hh m hk> 

(134) (5) (1) (2) (4) 

Pijk = 1^ Phn h im l2i m h3 m hk , 

(234) (5) (2) (3) (4) 

Pijk = 1^ Phn h im l2i m h3 m hk . 

From this it is easy to calculate the values simply by considering the results of the previous section: 

A® = W (1,2), A$ =u,(l,2), 

A$ = 0,^(1,3), A$ =^(1,3), 



A$ = o, ff (l,4), A<$ =u p (l,4), 

A$=ov(2,3), A$ = uj p (2, 3), 

A^=w ff (2,4), A^ =o, p (2,4), 

A« =c(3,4), A&> =0,(3,4). 



(22) 



where 



uj(a, b) = uj a + uj b , 
u> w (a, b) = uj a + U3 m + 0,6, 
Wp(a, 6) = o, a + o,„ + 0,6, 
o, m = — In det M, 
cu n = — In det N; 

and we have made use of (|18(l in the form 

We see that for any two taxa we have two options for assigning a pairwise distance. In 
the cases of the pairs (12) and (34) we see that either choice is consistent with the other, whereas 
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Figure 6: Three taxon subtrees. 



in the case of the pair (13), (14), (24) and (34) the two choices lead to an inconsistent assignment 
of the internal edge length upon the tree. Effectively what is happening here is that for a four 
taxa tree there are two possible edge length assignments for the internal edge and for a given pair 
of taxa (ab) and third taxa c, the tangle formula l|15|) is estimating the distance between a and b 
by assigning one of the two possible edge lengths to the internal edge depending on the topology 
of the tree. 

It is possible to eliminate this inconsistency by using either a max or min criterion in the 
construction of the distance matrix: 



i max 

Kb 



or 



until 
"ab 



By making one of these choices to construct a distance matrix one has choosen the directedness of 
the internal edge of the phylogenetic tree (J5J consistently. This procedure leads to an improvement 
of consistency upon the logdet technique for the construction of quartet phylogenetic distance 
matrices. It is hoped that this technique can be used fruitfully to improve the reconstruction of 
phylogenetic quartets, which can be used as a first step in the reconstruction of large phylogenetic 
trees |3ll2I]. 



6 Conclusion 



In this paper we have given a review of the standard assignment of branch weights to phylogenetic 
trees, reviewed the use of the logdet formula as an estimator of pairwise distances and shown how 
a previously unknown polynomial, the tangle, can be used to construct an improved estimator. 
We have generalized Felsenstein's pulley principle and used this result to show exactly how the 
distance matrix estimates become inconsistent when applied to the reconstruction problem of 
multiple taxa. We have shown that the tangle formula along with a max /min criterion can be 
used to remove this inconsistency and construct consistent quartet distance matrices. 
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